LTX 2 Improve `encode_video` by Accepting More Input Types#13057

Open

dg845 wants to merge 5 commits intomainfrom

ltx2-improve-encode-video

Collaborator

dg845 commented Jan 30, 2026 •

edited

Loading

What does this PR do?

This PR improves the diffusers.pipelines.ltx2.export_utils.encode_video function for exporting videos with audio by allowing it to accept input types besides torch.Tensor, such as np.ndarray and List[PIL.Image.Image]. This is meant to make it easier to use as users will not have to do special processing such as

# Pipeline `output_type="np"`
video = (video * 255).round().astype("uint8")
video = torch.from_numpy(video)

before calling encode_video; the PR will handle this logic inside encode_video. (Note that using the above processing will still work after the change, but will no longer be necessary.)

Changelist

Support np.ndarray and List[PIL.Image.Image] video inputs for encode_video, which are assumed to be outputs from pipelines using output_types of "np" and "pil", respectively.
Add a docstring for encode_video.
Update LTX 2 examples to use encode_video without any special processing.
Support chunked video encoding, in line with the official LTX 2 export code at https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/utils/media_io.py.

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@sayakpaul
@yiyixuxu

dg845 added 3 commits

January 29, 2026 09:37


          Support different pipeline outputs for LTX 2 encode_video

d5d2910


          Update examples to use improved encode_video function

2e18d2c


          Fix comment

857735f

HuggingFaceDocBuilderDev commented Jan 30, 2026

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

dg845 requested a review from sayakpaul

January 30, 2026 01:40

sayakpaul reviewed

View reviewed changes

Member

sayakpaul left a comment

Thanks! Left some comments.

src/diffusers/pipelines/ltx2/export_utils.py Outdated

-                  video_np = video.cpu().numpy()
+                  """
+                  Encodes a video with audio using the PyAV library. Based on code from the original LTX-2 repo:
+                  https://github.com/Lightricks/LTX-2/blob/main/packages/ltx-pipelines/src/ltx_pipelines/utils/media_io.py

Member

sayakpaul Jan 30, 2026

(nit): let's put a permalink.

src/diffusers/pipelines/ltx2/export_utils.py

+                          The sampling rate of the audio waveform. For LTX 2, this is typically 24000 (24 kHz).
+                      output_path (`str`):
+                          The path to save the encoded video to.
+                      video_chunks_number (`int`, *optional*, defaults to `1`):

Member

sayakpaul Jan 30, 2026

When is this option helpful?

Collaborator Author

dg845 Jan 30, 2026

The original LTX-2 code will use a video_chunks_number calculated from the video VAE tiling config, for example in two stage inference:

https://github.com/Lightricks/LTX-2/blob/4f410820b198e05074a1e92de793e3b59e9ab5a0/packages/ltx-pipelines/src/ltx_pipelines/ti2vid_two_stages.py#L257

For the default num_frames value of 121 and default tiling config TilingConfig.default(), I believe this works out to 3 chunks. The idea seems to be that the chunks correspond to each tiled stride when decoding.

In practice, I haven't had any issues with the current code, which is equivalent to just using one chunk. I don't fully understand the reasoning behind why the original code supports it; my guess is that it is useful for very long videos or if there are compute constraints.

Collaborator Author

dg845 Jan 30, 2026

See #13057 (comment) for discussion about some complications for supporting video_chunks_number.

src/diffusers/pipelines/ltx2/export_utils.py Outdated

+                      video = torch.from_numpy(video)
+                  elif isinstance(video, np.ndarray):
+                      # Pipeline output_type="np"
+                      video = (video * 255).round().astype("uint8")

Member

sayakpaul Jan 30, 2026

Should we check for the value range before doing this? Just to be safe.

src/diffusers/pipelines/ltx2/export_utils.py

+                  for video_chunk in tqdm(all_tiles(first_chunk, video), total=video_chunks_number):
+                      video_chunk_cpu = video_chunk.to("cpu").numpy()
+                      for frame_array in video_chunk_cpu:
+                          frame = av.VideoFrame.from_ndarray(frame_array, format="rgb24")

Member

sayakpaul Jan 30, 2026

Should we let the users control this format? 👀

Collaborator Author

dg845 Jan 30, 2026 •

edited

Loading

I think we could allow the users to specify the format, but this would be in tension with value checking as suggested in #13057 (comment): for example, if we always convert denormalized inputs with values in $[0, 1]$ to uint8 values in $\{0, 1, \ldots, 255\}$, that would probably make it difficult to support a variety of formats.

We could conditionally convert based on the supplied video_format, but my understanding is that there are a lot of video formats, and I don't think we can anticipate all of the use cases that users may have. So I think we could support a video_format argument with a "use at your own risk" caveat:

    elif isinstance(video, np.ndarray):
        # Pipeline output_type="np"
        is_denormalized = np.logical_and(np.zeros_like(video) <= video, video <= np.ones_like(video))
        if np.all(is_denormalized) and video_format == "rgb24":
            video = (video * 255).round().astype("uint8")
        else:
            logger.warning(
                f"The video will be encoded using the input `video` values as-is with format {video_format}. Make sure"
                f" the values are in the proper range for the supplied format".
            )
        video = torch.from_numpy(video)

An alternative would be to only support "rgb24" as the original LTX-2 code does with the idea that power users can use their own video encoding code if they have a different use case.

EDIT: the right terminology here might be "pixel format" rather than "video format".

Member

sayakpaul Jan 30, 2026

An alternative would be to only support "rgb24" as the original LTX-2 code does with the idea that power users can use their own video encoding code if they have a different use case.

Okay let's go with this.

src/diffusers/pipelines/ltx2/export_utils.py

+                      yield first_chunk
+                      yield from tiles_generator
+                  for video_chunk in tqdm(all_tiles(first_chunk, video), total=video_chunks_number):

Member

sayakpaul Jan 30, 2026

WDYT of getting rid of all_tiles() and doing it like so?

from itertools import chain

for video_chunk in tqdm(chain([first_chunk], video), total=video_chunks_number):
    video_chunk_cpu = video_chunk.to("cpu").numpy()
    for frame_array in video_chunk_cpu:
        frame = av.VideoFrame.from_ndarray(frame_array, format="rgb24")
        for packet in stream.encode(frame):
            container.mux(packet)

Collaborator Author

dg845 Jan 30, 2026

This does the right thing but appears not to work well with tqdm, which doesn't update properly from the chain object:

 33%|████████████████████████████████▋                                                                 | 1/3 [00:04<00:09,  4.57s/it]

Member

sayakpaul Jan 30, 2026

Okay!

Collaborator Author

dg845 Jan 30, 2026 •

edited

Loading

Actually, I think #13057 (comment) is wrong - as we generally supply a single torch.Tensor to encode_video (for example from a pipeline output), this line creates an iterator with one element:

diffusers/src/diffusers/pipelines/ltx2/export_utils.py

Lines 146 to 149 in 857735f

    
           if isinstance(video, torch.Tensor): 
        
               video = iter([video]) 
        
           first_chunk = next(video)

So when we call next(video) in the following line, the video iterator is now exhausted. So even if we set video_num_chunks > 1 in this case, our for loop through first_chunk and video will only yield one element in total, whether that's using all_tiles or chain. Thus, the progress bar will end up being wrong in this case since we tell tqdm that we have video_num_chunks > 1 elements when we in fact only have one.

I think the underlying difference is that the original LTX 2 code will return an iterator over decoded tiles when performing tiled VAE decoding, whereas we will return the whole decoded output as a single tensor with the tiles stitched back together. So maybe it doesn't make sense to support video_chunks_number as this will only work well when we supply an Iterator[torch.Tensor] to encode_video (in the current implementation).

dg845 added 2 commits

January 30, 2026 07:49


          Address review comments

cd60b3d


          make style and make quality

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet